This was actually my original plan before I started. Since the row normalized sparse matrix could be viewed as a conditional probability matrix I believe I could potentially take that to be conditional probability matrix that is actually created and used internally by t-SNE. This obviates the need to reduce the dimension before handing things to t-SNE since no actual distance computations would be required: I would already have the similarity/conditional-probability matrix. This ran into some difficulties which I will discuss below.
In [1]:
import pandas as pd
import scipy.sparse as ss
import numpy as np
import sklearn.manifold
import re
In [2]:
raw_data = pd.read_csv('subreddit-overlap')
In [3]:
raw_data.head()
Out[3]:
In [4]:
subreddit_popularity = raw_data.groupby('t2_subreddit')['NumOverlaps'].sum()
subreddits = np.array(subreddit_popularity.sort_values(ascending=False).index)
In [5]:
index_map = dict(np.vstack([subreddits, np.arange(subreddits.shape[0])]).T)
In [6]:
count_matrix = ss.coo_matrix((raw_data.NumOverlaps,
(raw_data.t2_subreddit.map(index_map),
raw_data.t1_subreddit.map(index_map))),
shape=(subreddits.shape[0], subreddits.shape[0]),
dtype=np.float64)
In [7]:
count_matrix
Out[7]:
Everything proceeds as per normal up the this point ... but now instead of using truncated SVD to reduce the vectors I was going to massage the count_matrix into the joint probability matrix that t-SNE uses internally and then reach into scikit-learn's t-SNE implementation a little to just hand it that matrix and let t-SNE proceed from there.
The obvious approach is to just l1 normalize the rows, call that the conditional probability matrix and then build the joint matrix by adding the transpose and normalizing. That didn't work so well. The trick was in t-SNE's use of varying kernel widths depending on the density of the point. I spent some time playing with various ways to emulate that given the data I had, and you can see the results below. In effect th goal is to convert counts to distances by inverting them, then normalizing by the distance to the 50th nearest neighbor, then converting back to similarities via an RBF kernel. We can get the joint by taking the geometric mean (there are reasons why this is a more correct choice than the arithmetic mean that t-SNE uses), and proceed from there.
In [8]:
count_matrix.data = 1.0 / count_matrix.data
count_matrix.data
Out[8]:
In [9]:
count_matrix = count_matrix.tolil()
In [10]:
normalizing_values = np.ones(10000)
for i, row in enumerate(count_matrix.data[:10000]):
normalizing_values[i] = np.sort(row)[50]
normalizing_values
Out[10]:
In [11]:
for i, row in enumerate(count_matrix.data[:10000]):
for j in range(len(row)):
count_matrix.data[i][j] /= normalizing_values[i]
In [12]:
count_matrix = count_matrix.tocsr()[:10000,:][:,:10000]
In [13]:
count_matrix.data = np.exp(-count_matrix.data**2)
In [14]:
count_matrix.data[count_matrix.data < 0.25] = 0.0
count_matrix.eliminate_zeros()
count_matrix
Out[14]:
Now we just convert the result of all the messing around to a joint probability matrix via a similar apprach as as t-SNE ...
In [15]:
joint_prob_matrix = np.sqrt(count_matrix * count_matrix.T)
joint_prob_matrix /= joint_prob_matrix.sum()
joint_prob_ndarray = joint_prob_matrix.toarray()
joint_prob_ndarray[range(joint_prob_ndarray.shape[0]),range(joint_prob_ndarray.shape[0])] = 0.0
In [16]:
neighbors = []
for row in joint_prob_ndarray:
neighbors.append((np.argsort(row)[-150:])[::-1])
neighbors = np.array(neighbors)
In [17]:
neighbors
Out[17]:
Now we need hand our joint probability matrix to t-SNE and have it work with that. This isn't so hard since the scikit-learn t-SNE code is well structured. That means to matrix generation is separated from the optimization well enough that I can instantiate a TSNE
object and then reach into one of the private methods (handing it a suitable transformation of the joint probability matrix) and let it run.
In [18]:
P = sklearn.manifold.t_sne.squareform(joint_prob_ndarray)
embedder = sklearn.manifold.TSNE(perplexity=50.0,
init='pca',
n_iter=2000,
n_iter_without_progress=60)
random_state = sklearn.manifold.t_sne.check_random_state(embedder.random_state)
subreddit_map = embedder._tsne(P, 1, joint_prob_ndarray.shape[0], random_state,
neighbors=neighbors)
Everything after this works exactly as normal ...
In [19]:
subreddit_map_df = pd.DataFrame(subreddit_map[:10000], columns=('x', 'y'))
subreddit_map_df['subreddit'] = subreddits[:10000]
subreddit_map_df.head()
Out[19]:
In [20]:
import hdbscan
In [21]:
clusterer = hdbscan.HDBSCAN(min_samples=5,
min_cluster_size=20).fit(subreddit_map[:10000])
cluster_ids = clusterer.labels_
In [22]:
subreddit_map_df['cluster_id'] = cluster_ids
In [23]:
from bokeh.plotting import figure, show, output_notebook, output_file
from bokeh.models import HoverTool, ColumnDataSource, value
from bokeh.models.mappers import LinearColorMapper, CategoricalColorMapper
from bokeh.palettes import viridis
from collections import OrderedDict
output_notebook()
In [24]:
color_mapper = LinearColorMapper(palette=viridis(256), low=0, high=cluster_ids.max())
color_dict = {'field': 'cluster_id', 'transform': color_mapper}
plot_data_clusters = ColumnDataSource(subreddit_map_df[subreddit_map_df.cluster_id >= 0])
plot_data_noise = ColumnDataSource(subreddit_map_df[subreddit_map_df.cluster_id < 0])
tsne_plot = figure(title=u'A Map of Subreddits',
plot_width = 700,
plot_height = 700,
tools= (u'pan, wheel_zoom, box_zoom,'
u'box_select, resize, reset'),
active_scroll=u'wheel_zoom')
tsne_plot.add_tools( HoverTool(tooltips = OrderedDict([('subreddit', '@subreddit'),
('cluster', '@cluster_id')])))
# draw clusters
tsne_plot.circle(u'x', u'y', source=plot_data_clusters,
fill_color=color_dict, line_alpha=0.002, fill_alpha=0.1,
size=10, hover_line_color=u'black')
# draw noise
tsne_plot.circle(u'x', u'y', source=plot_data_noise,
fill_color=u'gray', line_alpha=0.002, fill_alpha=0.05,
size=10, hover_line_color=u'black')
# configure visual elements of the plot
tsne_plot.title.text_font_size = value(u'16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None
show(tsne_plot);
As you can see the results don't look as good -- although if you go bakc up and remmove layers of the manipulations I performed on the conditional probability matrix and rerun things you'll see how much worse things get. I still believe this idea has merit, but making it work in practice involves going back to the drawing board to determine how to correctly manipulate the count matrix to make a suitable conditional probability matrix; hacking around, as I was doing here, will not cut it.
Finally, as usual for the experimental notebooks, I look at the actual content of the clusters.
In [25]:
def is_nsfw(subreddit):
return re.search(r'(nsfw|gonewild)', subreddit)
for cid in range(cluster_ids.max() + 1):
subreddits = subreddit_map_df.subreddit[cluster_ids == cid]
if np.any(subreddits.map(is_nsfw)):
subreddits = ' ... Censored ...'
else:
subreddits = subreddits.values
print '\nCluster {}:\n{}\n'.format(cid, subreddits)
In [ ]: